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THREE DIMENSIONAL OBJECT POSE ESTIMATION 
WHICH EMPLOYS DENSE DEPTH INFORMATION 



This disclosure is based upon, and claims priority from, provisional U.S. 
Patent Application No. 60/123,329, filed March 8, 1999, and provisional U.S. 
Patent Application No. 60/124,158, filed March 10, 1999, the contents of which 
are incorporated herein by reference. 

Field of the Invention 

The present invention is generally directed to the field of computer vision, 
and more particularly to the automatic estimation and tracking of the pose, i.e., 
position and/or orientation, of an object within a video image. 

Background of the Invention 

The ability to accurately estimate the three-dimensional position and 
orientation of an object, based solely upon video images of the object, is of 
increasing interest in the field of computer vision. For example, interactive human 
interface applications require the ability to quickly and accurately track the pose of 
a user. Information regarding the user's body position must be at or near real- 
time, to adjust the display of the interface in a meaningful, timely manner. For 
instance, an application which displays a three-dimensional view of an object 
requires accurate tracking of the position and orientation of the user's head, in 
order to present the frames containing an image of the object from an appropriate 
perspective. 

In general, previous approaches to pose tracking often relied on assumed 
models of shape, to track motion in three dimensions from intensity data, i.e., 
image brightness. Other approaches have employed depth data in conjunction with 
the image brightness information to estimate pose. Direct parametric motion has 
also been explored for both rigid and affine models. In this approach, it is 
preferable to utilize constraints in the analysis of the image data, to reduce the 
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number of computations that are required to estimate the pose of a figure. A 
comprehensive description of brightness constraints that are implied by the rigid 
motion of an object was presented by Horn and Weldon, "Direct Methods for 
Recovering Motion", International Journal of Computer Vision 2:51-76 (1998). 
5 Image stabilization and object tracking using an affine model v^ith direct image 
intensity constraints is described in Bergen et al., "Hierarchical Model-Based 
Motion Estimation", European Conference on Computer Vision, pages 237-252 
(1992). This reference discloses the use of a coarse-to-fine algorithm to solve for 
large motions. 

10 The application of affine models to track the motion of a user's head, as 

well as the use of non-rigid models to capture expression, is described in Black 
and Yacoob, "Tracking and Recognizing Rigid and Non-Rigid Facial Motions 
Using Local Parametric Models of Image Motion," International Conference on 
Computer Vision (1995). This paper describes the use of a planar face-shape for 

15 tracking gross head motion, which limits the accuracy and range of motion that 
can be captured. A sunilar approach, using ellipsoidal shape models and 
perspective projection, is described in Basu et al., "Motion Regular ization for 
Model-Based Head Tracking", International Conference on Pattern Recognition 
(1996). The method described in this publication utilizes a pre-computed optic 

20 flow representation, instead of direct brightness constraints. It explicitly recovers 
rigid motion parameters, rather than an affine motion in the image plane. Rigid 
motion is represented using Eular angles, which can pose certain difficulties at 
singularities. 

The tracking of articulated-body motion presents additional complexities 
25 within the general field of pose estimation. A variety of different techniques have 
been proposed for this particular problem. Some approaches use constraints from 
widely separated views to disambiguate partially occluded motions, without 
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computing depth values. Examples of these approaches are described, for 
example, in Yamamoto et al., "Incremental Tracking of Human Actions From 
Multiple Views", Proc. IEEE CVPR, pages 2-7, Santa Barbara, CA (1998), and 
Gavrila and Davis, "3D Model-Based Tracking of Humans in Action: A Multi- 

5 View Approach", Proc. CVPR, pages 73-80, San Francisco, CA (June 1996). 

The use of a twist representation for rigid motion, which is more stable and 
efficient to compute, is described in Bregler and Malik, "Tracking People With 
Twists and Exponential Maps", Proceedings of the IEEE Computer Society 
Conference on Computer Vision and Pattern Recognition, Santa Barbara, CA (June 

10 1998). This approach is especially suited to the estimation of chamed articulated 
motion. The estimation of twist parameters is expressed directly in terms of an 
image brightness constraint with a scaled orthographic projection model. It 
assumes a generic ellipsoidal model of object shape. To recover motion and 
depth, constraints from articulation and information from multiple widely-spaced 

15 camera views is used. This method is not able to estimate the rigid translation in 
depth of an unconnected object, given a single view. 

The techniques which exhibit the most robustness tend to fit the observed 
motion data to a parametric model before assigning specific pointwise 
correspondences between successive images. Typically, this approach results in 

20 non-linear constraint equations which must be solved using iterative gradient 

descent or relaxation methods, as described in Pentland and Horowitz, "Recovery 
of Non-Rigid Motion and Structure", PAMI, 13(7), pp. 730-742 (July 1991), and 
Lin, "Tracking Articulated Objects in Real-Tune Range Image Sequences", Proc. 
lEEEICCV, Volume 1, pages 648-653, Greece (September 1999). The papers by 

25 Bregler et al. and Yamamoto et al. provide notable exceptions to this general 

trend. Both resuh in systems with linear constramt equations, that are created by 
combining articulated-body models with dense optical flow. 
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In the approach suggested by Yamamoto et al. , the constraints between 
limbs are maintained by sequentially estimating the motion of each parent limb, 
adjusting the hypothesized position of a child limb, and then estimating the further 
motion of the child lunb. This approach is concepmally simpler than the one 
5 suggested by Bregler et al. , but results in fewer constraints on the motion of the 
parent limbs. In contrast, the method of Bregler et al. takes full advantage of the 
information provided by child limbs, to further constrain the estimated motions of 
the parents. 

Both Yamamoto et al. and Bregler et al. use a first-order Taylor series 
10 approximation to the camera-body rotation matrix, to reduce the number of 
parameters that are used to represent this matrix. Furthermore, both use an 
articulated model to generate depth values that are needed to linearize the mapping 
from three-dimensional body motions to observe two-dimensional camera-plane 
motions. 

15 The various techniques which employ depth information to estimate pose 

have typically utilized sparse depth data, e.g. representative sample points in an 
image. Recent imaging techniques now make it possible to obtain dense depth 
information, e.g. a depth value for all, or almost all, of the pixels in an image. 
Furthermore, this data can be obtained at video rates, so that it is real-time, or 

20 near real-time. It is an objective of the present invention to provide techniques for 
estimating pose which employ dense depth data. 

Summary of the Invention 

In accordance with one aspect of the present invention, dense depth data 
that is obtained at real-time rates is employed to estimate the pose of an articulated 
25 figure, using a model of connected patches. In another aspect of the invention, the 
dense depth data is used in conjunction with image intensity information to 
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improve pose tracking performance. The range information is used to determine 
the shape of an object, rather than assume a generic model or estimate structure 
from motion. The shape data can be updated with each frame, offering a more 
accurate representation across time than one which is provided by an initial, or off- 
5 line, range scan. In accordance with this feature of the invention, a depth 
constraint equation, which is a counterpart to the classic brightness change 
constraint equation, is employed. Both constraints are used to jointly solve for 
motion estimates. 

By observing the change in depth directly, rather than inferring it from 

10 intensity change over time, more accurate estimates of object motion can be 

obtained, particularly for rotation out of the unage plane and translation in depth. 
Depth information is also less sensitive to illumination and shading effects than 
intensity data as an object translates and rotates through space. Hence, the dense 
depth data is frequentiy more reliable than the brightness information. 

15 In the case of articulated joints, twist mathematics are used to capture the 

motion constraints. Unlike previous approaches, a first-order Taylor series 
expansion is not employed to approximate the body-rotation matrix. Rather, this 
approximation is achieved by solving the constraints on a transformed parameter 
set, and remapping the results into the original parameter set using a closed-form 

20 non-linear function. 

As a further feature of the invention, the brightness change constraint 
equation and the depth change constraint equation are re-derived, using shifted foci 
of expansions. This extension permits the constraints to be used on large motions 
without iteration. The constraint matrices are also modified to decouple the 

25 updates of body rotation and body translation from one another. 



These and other features of the invention are described in greater detail 
hereinafter, with reference to various examples that are illustrated with the 
assistance of the accompanying figures. 

Brief Description of the Drawings 

Figure 1 is a schematic illustration of a pose estimation system in which the 
present invention can be employed; 

Figure 2 is a block diagram of an unage processor in accordance with one 
aspect of the invention; 

Figures 3a and 3b are illustrations of a planar patch model for a single 
segment and two articulated segments of a figure, respectively; 

Figure 4 is a graph depicting the function for observed farness; and 

Figure 5 is a graph depicting the function for expected farness. 



To facilitate an understanding of the invention, it is described hereinafter 
with reference to its application in the context of estimating and tracking the pose 
of an articulated object, such as the human body, since this example presents a 
particularly interesting and useful application of the mvention. It will be 
appreciated, however, that the practical utility of the invention is not limited to this 
particular simation. Rather, the invention can be employed with success in a 
variety of different situations, such as the estimation of the pose of singular rigid 
objects. 

In the implementation of the invention, the pose of a figure is estunated by 
determining a reference, or initial pose, and then tracking changes over successive 
images to obtain an updated estimate of the pose. In the context of the folio wmg 
description, the term "pose" is employed in a generic sense, to indicate either or 
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both of the position of the object within a three-dimensional space, and the 
orientation of the object at that position, e.g., the rotational position of the object, 
or individual components of the object, relative to three orthogonal axes. In other 
words, the term "pose" encompasses motion in any or all of six degrees of 

5 freedom, namely translation along three orthogonal axes and rotation about three 
orthogonal axes. It will be appreciated that the invention can be applied to 
estimate only the position of the object, without regard to its orientation, or only 
the orientation of the object, without regard to its global position. 

Figure 1 schematically illustrates one example of a pose estimation system 

10 in which the present invention can be implemented. An object whose pose is to be 
estimated and tracked, such as a human body 10, is located within the field of view 
of a pair of spaced video cameras 12a and 12b, which provide a sequence of stereo 
images of the object. One of the cameras, e.g., camera 12a, is designated as a 
reference, or master, camera. Data describing the intensity of each of the pixels m 

15 the image sensed by this camera is provided to an image processor 14. The image 
data from both of the cameras is provided to a range processor 16. The range 
processor computes the distance of each point on the object from a reference 
position, such as the image plane of the reference camera, and provides this 
information to the image processor 14. In general, each point on the object 

20 corresponds to a pixel within the image sensed by the reference camera. In a 

preferred embodiment of the invention, the stereo pair of cameras and the range 
processor 16 constitute a video-rate range sensor of the type described in Woodfill 
and Von Herzen, "Real-Time Stereo Vision on the PARTS Reconfigurable 
Computer", Proceedings IEEE Symposium on Field-Programmable Custom 

25 Computing Machines, Napa, CA, pages 242-250 (April 1997). In this type of 

system, the correspondence between pixels in the respective images from the two 
cameras 12a and 12b is computed according to the census algorithm, as described 



in detail in Zabih and Woodfill, "Non-Parametric Local Transforms for 
Computing Visual Correspondence", Proceedings of the Third European 
Conference on Computer Vision, Stockholm, pages 151-158, May 1994. The 
disclosures of both of these publications are incorporated herein by reference. The 

5 output of such a system comprises a depth value for each pixel in the unage from 
the reference camera 12a, together with a confidence factor for the value. 

The range information which is provided by such a system is dense data. 
In the context of the present invention, "dense" data is understood to mean data 
obtained from a predefined, fixed sampling grid that is independent of the 

10 parameter being determined, such as the pose of the figure. In a preferred 

implementation of the invention, the sample grid comprises the array of pixels on 
the image sensing device, e.g. CCD, of the camera. In this case, each pixel 
provides one sample of depth data. Rather than employ the data from every pixel, 
it is also possible to obtain depth data with a different resolution by means of a 

15 coarser, but regular, sampling grid that is established a priori, e.g. every 2nd 

pixel, every 5th pixel, etc. The data is also dense in the temporal sense, i.e. it is 
obtained at a regular sampling interval, such as a video frame rate. 

Within the image processor 14, frame-to-frame changes in the image 
mtensity and/or depth information are used to track the motion of the object and 

20 thereby estimate its pose. Three-dimensional motion of each pomt in space 

induces a corresponding two-dimensional motion of the projection of that point 
onto the image plane of the camera. In accordance with the mvention, two 
different approaches are employed to estimate the pose from the camera outputs. 
One approach utilizes the range data alone, and employs a shape model to 

25 determine the figure's pose. In the other approach, both brightness information 
and range data are employed, without reference to a shape model. The pose 
estimate is obtained by computing the velocity of each point on the object from 
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frame to frame. In the following description, capital letters are employed to 
represent parameter values in the three-dimensional space, and lower case values 
are employed for the corresponding parameters in the two-dimensional image 
plane. Hence, the coordinates of a point in three-dimensional space is expressed 
5 as [X,Y,Z], and its corresponding point in the camera image plane is identified as 
[x,y]. 

Before describing the manner in which dense depth data is employed to 
estimate pose in accordance with the invention, a background discussion of the use 
of brightness data in the image processor 14 is presented. The velocity of a point 

10 in the video image is computed in accordance with a standard brightness change 
constraint equation for image velocity estimation. This equation arises from the 
assumption that intensities undergo only local translations from one frame to the 
next in an image sequence. This assumption does not hold true for all points in the 
image, however, since it ignores phenomena such as object self-occlusion, i.e., a 

15 portion of the object disappears from view as it rotates away from the camera, and 
changes in intensity due to changes in lighting, e.g., the object moves into a 
shadow. 

The brightness change constraint equation can be expressed as follows, for 
image frames at times / and t+l: 

20 B(x,y,t+l)=B(x-Vx,y-Vy,t) (1) 

where B(x,y,t) is the image brightness, or intensity, at time t, and v^ and Vy are 
the motions m X and Y, after projection onto the image plane. If it is assumed 
that the time-varying image intensity is well approximated by a first-order Taylor 
series expansion, the right side of Equation 1 can be expanded to obtain: 

25 B(x,y,t+ 1) = B(x,y,t) - v^Bx(x,y,t) - VyBy(x,y,t) (2) 

where Bx(x,y,t) and By(x,y,t) are image intensity gradients with respect to x and y 
as a function of space and time. Rearranging these terms into matrix form yields a 
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commonly employed gradient formulation of the brightness change constraint 
equation: 



This equation can be used to estimate image plane velocities, but 3-D real- 
world velocities are desired. For a perspective camera with focal length/, the 
relationship between the two sets of velocities may be derived from the perspective 

camera projection equations x=— , and y=— . Taking the derivatives of 
Z Z 

these equations with respect to time yields 



V 



(3) 



-B^ - [B^ By] 



V 



V 



dx 



dt 

^ = 



dt 




(4) 



This can be written in matrix form as 




(5) 



The right side of the above equation can be substituted into Equation 3 to obtain 
the constraint equation in terms of 3-D object velocities: 
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= -\fB^ fB 



f 0 -X 

0 f -y _ 

-(xB^ +yB^) ] V 



(6) 



where V = [V^ V^f . The 3-D object velocities V are then further 
constrained according to rigid body object motion. Any rigid body motion can be 
expressed in terms of the instantaneous object translation P^IP^P^ P^f and the 

instantaneous rotation of the object about an axis Q - [co^ Wj, w^f • Q 

describes the orientation of the axis of rotation, and \Q\ is the magnimde of 
rotation per unit time. For small rotations, 

V = f + QxX=f~XxQ (7) 

The cross-product of two vectors may be rewritten as the product of a skew- 
symmetric matrix and a vector. Applying this to the cross product X x Q results 



X X Q = XQ, where X = 



0 -Z F 
Z 0 -X 
-7X0 



Equation 7 can be rearranged into the convenient matrix form 
V = Q$ 



(8) 
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by defining the motion parameter vector ^ as [f Qf, and defining the matrix 



[I -X] = 



1 0 0 0 Z -Y 
0 1 1 -Z 0 X 
0 0 I Y -X 0 



(9) 



The Q matrix may also be written in terms of image x and y coordinates, 
instead of 3-D world X and 7 coordinates, in order to be consistent with the 
constraint equation as derived thus far: 

10 0 0 



z 


-zy 




f 


0 


Zx 




T 


-Zx 


0 


f 





(10) 



5 Using either form of Q, substitution of the right side of Equation 8 for V in 

Equation 6 produces a single linear equation relating image intensity derivatives to 
rigid body motion parameters under perspective projection at a single pixel: 

-5 = 1 [/B /B -{xB^ + y5 )]e$ (11) 
Z ^ 

Since video-rate depth information is available via the range processor 16, 
in accordance with the present invention changes in the depth image over time can 
10 be related to rigid body motion in a manner similar to that shown for intensity 

information above, and employed in the image processor 14 to estimate pose. For 
rigid bodies, an object point which appears at a particular image location (x, y) at 
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time t will appear at location + v^, j + v ^ at time t + I. The depth values at 
these corresponding locations in image space and time should therefore be the 
same, except for any depth translation the object point undergoes between the two 
frames. This can be expressed in a form similar to Equation 1: 

Zix,y,t)+Vp.,y,t) = Z{x+vJ,x,y,t),y+v^{x,y,t),t+\) (12) 

The same series of steps described above for deriving the brightness constraint on 
rigid body motion can now be used to derive an analogous linear depth change 
constraint equation on rigid body motion. First-order Taylor series expansion, 
followed by rearrangement into matrix form, produces 



Use of perspective camera projection to relate image velocities to 3-D 
world velocities yields 



Finally, 3-D world velocities are constrained to rigid body motion by 
introducing the Q matrix 



V 



-Z, = [ ] 



(13) 



y 




(14) 



-Z = -\JZ, fZ^ -(Z + xZ^ + yZ ) ]Q$ 



(15) 
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This linear equation for relating depth gradient measurements to rigid body motion 
parameters at a single pixel is the depth analog to Equation 11. 

In many applications, it is possible to approximate the camera projection 
model as orthographic instead of perspective without introducing significant error 
5 in 3-D world coordinate estimation. For pose tracking algorithms, use of 

orthographic projection simplifies the constraint equations derived previously, 
making the solution of linear systems of these equations much less computationally 
intensive. 

To derive the orthographic analogs of Equations 11 and 15, the perspective 
10 projection relationship is replaced with the orthographic projection equations x = 
Xsiudy = Y, which in turn hnply that = V, and = Vy. Hence, Equation 5 is 
replaced by the simpler equation 



1 0 0 
0 1 0 



(16) 



Proceeding through the remainder of the derivation of Equation 1 1 yields its 
orthographic projection analog: 

-S, = I B^B^O ]Q$ (17) 



15 



Similar modification to the derivation of the depth change produces the 
orthographic counterpart to Equation 15: 

-Z, = [Z^Zy -1 ]Q$ 



(18) 
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At each time step, registered intensity and depth images are obtained. 
Intensity and depth gradients are computed with respect to unage coordmates x and 
y, and with respect to time, at each pixel location. Intensity and depth constraint 
equations of the form of (11) and (15) can be written for each pixel location. 
5 Because the intensity constraint equations (11) are linear, they can be combined 
across N pixels by stacking the equations in matrix form: = bj. 

Hj e 3^^-"^, where each row is the vector obtained by multiplying out the 
right side of Equation 12 at a single pixel i. e where the ith element is 

-Bt at pixel i. The / subscripts on the variables indicate that they reflect only the 
10 use of intensity constraints. Provided that > 6, this system of linear equations 
will over-constram the motion parameters $^ so that they can be solved by the 
least- squares method: 

4 = -{H^Hf'H^b^ (19) 

The linear depth constraint equations can be combined similarly across 
pixels to form the linear system H^^j^ = 6^, where e m^'''^, b^ e Sl^'^and 

15 the elements of Hq and &^ are derived from Equation 15 in a manner analogous to 
that explained for the intensity linear system above. The D subscripts on the 
variables indicate that they reflect only the use of depth constraints. Provided that 
A'^ > 6, the motion parameters $^ can be solved according to the least-squares 
method, as in Equation 19. 

20 The intensity and depth linear systems can be combined into a single linear 

system for constraining the motion parameters $: i?$ = b, where H = 
[H, AHd f, and b = [ b^ Xb^ f. 

The scaling factor, X, provides control over the weighting of depth 
constraints relative to intensity constraints. In cases where depth can be expected 
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to be more reliable than intensity, such as under fast-changing lighting conditions, 
it is preferable to set A to a value higher than 1, but under other conditions, such 
as when depth information is much noisier than intensity, lower X values are 
preferably used. The least-squares solution to the above equation is 

$= -(H ^H)-^H -"b (20) 

To estimate the motion of an object, it is preferable to combine the constraint 
equations only across pixel locations which correspond to the object, and for 
which intensity, depth, and their derivatives are well-defined. To do this, a 
support map w(x,y) e [0,1] is used which indicates the probability that each pixel 
corresponds to the object of interest and that its measurements and derivatives are 
reliable. The least-squares solution is modified to weight the contribution of the 
constraint from pixel location (x,y) according to w{x,y}: 

$= -(H^W^WH)-^H^W V (21) 

W G is a diagonal matrix whose entry W(/,z) = w(x„y^. If a binary 
support map is used, i.e., all values of w(x,y) are either 0 or 1, the W matrix can 
be omitted, and H and ^ are removed from all rows corresponding to pixels i for 
which w(xi,yi) =0. A support map may also be applied m similar fashion when 
solving for motion parameters usmg depth or Intensity constraints only. The three 
support maps for these different constramt combinations do not need to be the 
same. 

The motions estimated between individual pairs of frames are added 
together to form an estimate of cumulative object motion over tune. It may be 
beneficial to supplement this tracking technique with another algorithm for 
determining when motion estimation has accumulated substantial error, and to 
reinitialize the object shape estimate at these times. 



-17- 



In a preferred embodiment of the invention, velocities of the visible points 
on the figure can be translated into rotations and translations of the figure and its 
limbs relative to the world coordinates, using twist mathematics. The brightness 
and depth constraint equations. Equations 5 and 13, can be combined as follows: 



Blx,y,f) 




BJ,x,y,t) By{x,y,t) O' 


Z,{x,y,t) 




Z^(x,j,r) ZJix,y,t) -1 



5 This formulation gives IN equations in terms of 3A^ unknown motion 

vectors (v^, Vy, and Vz for each point) where A'^ is the number of visible points on an 
articulated figure. Two of the vectors are in the image plane and the third is in the 
real world. In order to translate image-plane velocities into world-coordinate 
velocities, a camera model is employed. Using a perspective camera model and 

10 assuming that the origin of the world coordinate system is at the camera and the z- 
axis is along the viewing axis of the camera, so that x ^ f^, and J = ' results 
in: 







_X 

Z Z 










0 i -y- 

z z 

0 0 1 







(23) 



This constraint simply changes the unknown parameters. The total is still 2N 
equations in terms of 3 A?^ unknowns (v^, Vy, and for each point). 
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A twist ^ = , is a 6-element vector with the fkst 3 elements, v, 
indirectly representing the translation and the last three elements w representing 
the axis (and sometimes the amount) of rotation. As a matter of convention, if the 
twist is used with an explicit scaling term 6, then LI = 1; otherwise, the 
magnitude of w is set according to the amount of rotation. The twist can be used 
to form a 4x4 matrix, through the operation of the "hat operator": 



-0)3 0), 
where d) = "3 0 

o)j 0 

When exponentiated, this 4x4 matrix gives the rotation/translation matrix 
where e*^ 



s the rotation and p =((/-« <^^)(D+OL)co^e)v is the 



translation, which maps x, y and z to v. 

Using twists, the world coordinates (q^) of any point on the body can be 
expressed as a function of time, of the point's limb number (k), of the pose 
parameters (^0 and 6), and of the point's Iraib-centric coordmates (qi): 



= gJUf^M)\kX,,...,^^-,)q, 



where K is the number of articulated lunbs in the figure. The mapping from lunb- 
centric coordinates to world coordinates is done by translation/rotation as dictated 
by a "reference configuration" for the limb, ^,/0); by the translations/rotations 



introduced by the joints along the articulated chain up to the ^ 



limb; and by the translations/rotations e^° from the camera to the figure's torso. 
The parameters E,^ to ^^-i define the position of each joint on the figure relative to 
its reference joint, e.g. the elbow relative to the shoulder, the wrist relative to the 
elbow, etc. 
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Each limb's reference configuration gives the translation and rotation from 
that limb's coordinate system to the world coordinate system, when the body is 
positioned at = 0 and when all of the joint angles are zero. The extra degrees 
of freedom given by the reference configuration simplifies the task of describing 
the geometry of the articulated joint locations. Given a specific pose, the 
transformation from the limb's coordinate frame to the world coordinate frame is: 



«„(5o(o.e«)l*,E„...,^,-,) 



For notational simplicity hereinafter, gJ^^it)Mt)\k,^i will be identified 

as gsb- 

Using this description of the world coordinates of each body point in terms 
of the articulated-pose parameters, the world velocities can be related to the 
rotations and translations of the K coordinate frames that are tied to the K limbs of 
the figure. Since q,(t) = g^^^^, and is independent of time: 
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V q(t) 
sb 



The second line of the above identity is derived from the inverse of the identity 
qfi) = g^^qt,. The third line is by definition: v/^ = g^^^ ^ "^^^ matrix 
describing the motion of the 1<^^ limb's coordinate frame relative to the world 
coordinate frame, in terms of world coordinates. Using , a 6x1 vector, to 
describe this coordinate transformation makes use of the special strucmre g^^gj ■ 
Specifically, the first three rows and columns of g^^^g^l^ are skew symmetric and 
the bottom row is all zeros. - [q^ q^lf is the homogenous world 

coordinates of the body point at time t. More generally, q^, qy, and qz are the 
coordinates of the point within a coordinate system that is tied to the world 
coordinate system by some known translation and rotation. Reformulating the 
above equation, 



1 0 0 0 92 
0 10-9^ 0 
0 0 1 9y -q^ 0 



(25) 



Hence, the X-,Y- and Z-axis motion of each point on the body is obtained 
by multiplying the 6 motion parameters for each limb by the matrix. At this point, 
each limb has been forced to be internally rigid. This reduces the number of 

5 

unknown parameters down to 6 parameters per limb (the 6 elements of V for 

sb 
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that limb). There are now IN equations in terms of 6K unknowns, where K is the 

number of articulated limbs. 

The solution is further constrained by taking advantage of the 
s 

interconnections between limbs. To do this, V for the 1^ limb is described in 

sb 

terms of the articulated-pose parameters: 

v' -v' + E (epe. (26) 

sb sO j=i /-I, I 

s 

where V is the velocity due to the motion of the body relative to the world 

sO s 
coordinates and the Jacobian value / (0.) 6. is the velocity due to the motion 

of the i'^ joint along the articulated chain to the Irnib. Using the identity 
= provides: 

d 



10 Thus, six parameters are employed for the torso, but all other limbs are 

reduced to one degree of freedom 0 for each limb. To simplify this farther, the 



adjoint of a rotation/translation matrix, g = 



R PR 
0 R 



Using this definition, g^g ~^ = (Adj(^)O" , where ( y means 



that the hat operator is applied to the vector contained within the parentheses. 
15 Using this identity. 
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(27) 



The velocity due to the motion of the figure relative to the world 
coordinates should allow for unconstrained rotations and translations. An easy 
way to do this is to express these motions in terms of the 4x4 transformation 



matrix, instead of in terms of the twist coordinates. Let e 



Then: 



sO [dt ) 
0 0 



This constraint is linear in terms of the unknowns (R^ and p^) but it is not 

tightly constrained. has 9 unknowns, instead of the 3 unknowns that can be 

used to describe a rotation or its derivative. This over-parameterization is 

corrected by noting that the first 3 rows/columns of V must be skew symmetric. 

sO 

To captore this structure, the rotational component of the frame velocity is defined 

as o) = v"^ - iR^Rj)^ where ( )^ is the inverse of the hat operator on the 
sO, 

skew-symmetric matrix contained within the parentheses. 

It is significant to note that this is not a "small angle approximation" such 
as is often used for mapping a rotation matrix down onto its rotation axis. The 
identity co^ - is exact. The difference is that there is a special structure 

embedded in the derivative of a rotation matrix. The only way that an 
orthonormal matrix can transform into another orthonormal matrix is structured so 
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that the derivative matrix times the transpose of the orthonormal matrix is a skew- 
symmetric matrix. 

Substituting and rearranging, the velocity of the torso is mapped back to 
the twists, with six degrees of freedom, as follows: 



s 








V = 








*0 


0 / 







(28) 



Once an estimate is obtained for (based on IN linear constraint 
equations), it is translated into an estimate for by rearranging the defining 
equation: 

= &^R,' = (i>^R, (29) 

There are now 2N linear constraint equations in terms of K + 5 unknowns 
((o)^ , p^, 0j through 0^ j) plus one auxiliary equation to remap o)^^ back into 

R^ . Thus, a modified parameter set is employed to maintain linearity during the 
computations, and then mapped back to the original set non-linearly. 

Discrete-tune approximations to the time derivatives, R^, p^, and G^_j are 
then determined. A forward-difference approximation to the body-translation and 
joint-angle derivatives is used: 

e.(o^0.(?+i)-e.(o 
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A forward-difference approximation to the body-rotation derivative is 
not desirable since using this approximation destroys the orthonormal structure of 
the rotation matrix Rq. Instead, a central-difference approximation is preferred: 

Using this central difference approximation along with Equation 29 and a 
first-order linear interpolation for the half-sample delay, produces: 

R^{tn)-R,it) = (D J ^ ^ j 

so that 

Combining Equations 22, 23, 25, 26, 27, 28 and 30 again results in 2N 
linear constraint equations in terms of ^ + 5 unknowns. The difference is that the 
unknowns are now the updated parameters co^ , p^it+1), Q^(t+l) through 
0^ These constraints are solved using least squares. Once that solution is 

obtained. Equation 31 provides the non-linear mapping from o)^ to R^(t+1). 

The quality of the above constraints depends on the accuracy of the first- 
order Taylor-series expansion used in the brightness and depth constraints. This 
first-order approximation often fails on large motions. Conventionally, this type 
of failure is compensated by estimating the motion, warping the images according 
to that motion estimate and repeating. This iterative estimation approach has 
several drawbacks. It is computationally expensive, requiring sample interpolation 
of the image being warped, re-computation of its spatial derivatives, and multiple 
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formulations and solutions of the constraint equations at each time step. It 
introduces interpolation errors, both in the warped image values and in the spatial 
derivatives. Finally, for large motions, the mitial motion estimates may actually 
point away from the true solution. 
5 The accuracy of the constraints can be improved without iteration, without 

interpolation, and without recomputing spatial derivatives, by allowing the focus 
of expansions (FOE) of each point to shift "independently" by some integer 
amount, (S^, S^). Shifting the FOE by (S^, Sy), the constraints become: 

B(x-S^,y-Sy,t+l)'Bix,y,t) 
Z{x-S^,y-S^,t + \)-Z{x,y,t) 



B^{x,y,t) 




B^ix,y,t) 






Z^ix,y,t) 




Z/x,y,t) 



(32) 



B^ix,y,t)B^(x,y,t) 0 
Z^(,x,y,t)Z/x,y,t) -1 

Equation 32 treats each constraint equation as if the t-l- 1''' frame translated rigidly 
10 by (S^, Sy). As long as (S,, Sy) is integer valued, it is not necessary to interpolate 
the image. Since the equations assume rigid translation, there is no need to 
recompute the spatial derivatives of the t'" frame (as would be required to warp the 
t-l- 1'^ frame). Even though each individual constraint derived from Equation 32 
acts as though the frame was rigidly translated, the set of constraints across the 
15 visible image does not have to share {S^, Sy) values. Instead, at each pixel, a new 
(S^, Sy) can be selected according to the expected shift for that pixel. 



V 
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This provides the freedom to choose a distinct value (5,, 5'^) for each visible 
point on the figure. This could be done within the traditional iterative motion 
estimation framework, by estimating the motion without shifting first to get an 
initial hypothesis for the image motion. Using that approach, the offset dictated 
5 for each pixel is rounded by the unshifted motion estimates to the nearest integer 
values, and those values are used for (S^, S^) in Equation 32. The motion is then 
re-estimated according to the constraint equations generated by Equation 32. This 
process could be iterated as often as needed to converge to a stable motion 
estimate. 

10 It is more preferable, however, to use the cross correlations between times 

t and / -I- 1 of the brightness and depth images. This cross correlation is evaluated 
for each limb separately, allowing a small number of 2D (unage plane) translations 
and rotations. To avoid the overhead of image interpolation, zero-order hold can 
be used to provide non-zero rotations of the limbs. A nominal translation/rotation 

15 is selected for each limb, based on the peak of this cross correlation. 

Having selected a nominal translation/rotation of the FOEs for a limb, for 
each point on the limb the integer-valued offset is used for and Sy nearest to the 
offset dictated by the selected translation/rotation. Again, and Sy are preferably 
integers to reduce the computation and to avoid interpolation errors. 

20 Equation 32, with Equations 23, 25, 26, 27, 28 and 30, provides 

brightness and iV depth constraints, which can be solved with least squares, on 
K + 5 unknowns (p^it+l), R^it+l), and 6^(^+1) through e^_^it+l)). From that 
solution. Equation 31 provides the non-linear mapping from co^ to Rq (t+1). 

Camera-centric coordinates are preferably used in the constraint Equation 

25 23. However, any coordinate system can be used that is offset by a known 

rotation/translation from the camera coordinate system for the coordinates (qx, qy, 
q^ used in Equation 25, to eliminate bias. To improve the conditioning of the 
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constraint equation, the centroid of the visible figure is chosen as the origin of {q^. 

Any orientation/translation can be used as a reference configuration g^^.(0) 
in Equation 24. The following discussion explains a preferred way to use this 
freedom to avoid estimation errors due to cross coupling between the body 
position and the body rotation estimates. 

Equation 28 includes cross coupling between the estimates for and 
according to the size of Pq. Based on the derivations and simulations, for figures 
that change orientation relative to the camera, this coupling term should be 
present. The desirability of this coupling term is not apparent in simulations until 
the figure changes orientation relative to the camera. In simulations, when was 
non-zero and the figure was rotating relative to the camera, this resulted in a bias 
in the estimates of p^. The error resulting from this bias increases linearly over 
time, so its effects become more detrimental as the sequence of video images gets 
longer. 

This bias is avoided by re-parameterizing the twists, at each time step, so 
that Pq = 0. This can be done without affecting the coordinate-system origin for 
(fe <iY' ^z) by adjusting ^^^.(0), the reference configurations for the limbs (see 
Equation 24). This allows the conditioning of the constraints to be improved (as 
previously described) while still avoiding coupling. 

To remove this coupling without altering the articulated figure's geometry. 



T 

(OX J? 
' 0 



is subtracted from each internal joint ^. 



(i>l). (An offset of 



T 

R p^ is implicitly added to the last column of ^j^(O). However, since ^^^.(0) never 
actually appears in the final constraint equations, this offset is more conceptual 
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than computational.) This maintains the original geometry since, if exp 
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When these transformations are used to remove the cross-coupHng between 
p and CO , p also needs to be transformed back to the original coordinate 
system. This is done by setting p^^t^) = p^+Rgit+l)R^(t)p^(t). 

In accordance with a second aspect of the invention, the dense depth data 
5 can be employed in conjunction with a shape model to estimate the pose of an 
articulated figure. As in the previous approach which employs the constraint 
equations, a figure is tracked by incrementally estimating its pose at regular 
intervals, e.g. for each video frame. In other words, sequential state estimation is 
employed to obtain the figure's pose. 

10 State estimation is the process of determining a set of parameters for a 

given model that best accounts for a set of observations within the realm of that 
which can be expected. More specifically, the state estimation finds the most 
probable sequence of states given a set of observations and a set of expectations, in 
other words the sequence of states with the maximum a posteriori probability. 

15 However, since real image data attained from the video cameras are subject to 
noise and occlusions, the observations will generally have some erroneous or 
missing elements. In order to alleviate the effects of these imperfections on the 
results, the process of maximizing the agreement of the observations with a model 
is refined by considering a priori expectations, such as smooth motions or 

20 bounded accelerations or velocities. As a result, the process is generalized to 

minimizing a residual, where the residual takes into the account the agreement or 
correlation of a proposed state with the observations, as well as some measure of 
the unexpectedness or innovation of the proposed state, based on past history. 
A more detailed block diagram of the image processing portion of the 

25 system for estimating pose of an articulated figure in accordance with this aspect 
of the invention is illustrated in Figure 2. A model 20 is constructed for the figure 
whose pose is to be estimated. A state generator 22 produces a hypothesis for the 
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pose, or state, which is applied to the model to generate an expected input 24. 
This expected input is compared with the actual input, i.e. the dense deptii data 
from the range processor 16, in a comparator 26. The errors between the expected 
input and the actual input are then applied to the state generator 22. A new 
5 hypothesized state is then generated, in an effort to reduce the error. This process 
continues in a recursive manner, until the residual error has been minimized. At 
this point, a set of output parameters 28 are produced, as an estimate of the 
figure's pose. 

The model 20 consists of a set of connected planar patches, each of which 

10 is in the shape of the convex hull of two circles. Figure 3a illustrates one such 
patch, which corresponds to a segment of the figure, e.g. a lunb. The radius (r) 
and three-dimensional location (x,y,z) of each circle are variable parameters, 
which are estimated by the state generator 22. The connectivity of the patches is 
fixed, and provided by the user. For example. Figure 3b illustrates two patches 

15 that respectively correspond to two limbs that share a common joint, e.g. the 
upper arm and forearm of a person. At the joint, the circles of the two patches 
coincide in the two-dimensional image plane of the camera 12a. 

The visible surface of each articulated segment of the figure is modeled by 
a single patch, such as that shown in Figure 3a. Each surface patch Sj, = S(y), and 

20 is defined by two nodes n„ Uj at its ends. Each node is specified by four scalar 

values (X;, y^, z„ r^), which indicate its location and size. Given the values for two 
adjacent nodes n, and Uj, the connecting model patch S(y) is a region of a plane with 
a range map R^^) that passes through the two points (x^, y„ and (Xj, yj, Zj) but is 
otherwise parallel to the image plane. Thus, for each segment of the figure, there 

25 corresponds a windowed depth map. The set of these maps over all segments \, 
where k ranges over the pairs (i,j) for which a segment exists, forms the complete 
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state estimate against which the dense depth data from the processor 16 is 
compared. 

Within the comparator 26, a determination is made for each pixel u of each 
segment S^, of the correlation between the observed and expected inputs. This 
5 correlation is based the difference between the range I(u) that is observed on the 
actual input for the pixel, and the range R,,(u) that is predicted by the segment. 
Since there are likely to be errors in both the measurement and the model, the 
magnitude of the difference is compared against a finite threshold. Pixels whose 
difference values lie above or below the threshold are respectively identified as 
10 being far from or close to the surface. Figure 4 illustrates an example of a 

threshold function for observed farness. In this case, a pixel which lies within the 
threshold, and is therefore considered to be close to the surface, is assigned an 
observed farness value of -1, whereas pixels which are far from the surface have a 
value of -I- 1 . 

15 Since the surface patches of the model are planar, they have a uniform 

range across their widths, i.e., in a direction perpendicular to an axis intersecting 
the two nodes. However, for many three-dimensional objects, the measured range 
will vary across the width of the object, due to its thickness. This discrepancy 
could result in erroneous correlation results. Therefore, as a second measure, the 

20 correlation between the observed input and the estimated input is confined to an 
area near the boundary of a segment. A true segment boundary has pixels of the 
segment inside the boundary, and pixels pertaining to other structure outside the 
boundary. Hence, the depth value for pixels that are just inside the boundary 
should fit the segment model, and the depth of pixels just outside the boundary 

25 would not be expected to fit the model. Figure 5 illustrates a function that can be 
employed for the expected farness of a pixel, based on these considerations. The 
abscissa of this graph represents the shortest distance from the pixel of interest (u) 
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to the boundary of the segment. Hence, pixels which lie just outside the boundary 
have an expected farness value of -1-1, and those which lie just inside the boundary 
have an expected farness value of -1 . The function goes to 0 at a predetermined 
distance from the boundary, since only those pixels which are relatively close to 
5 the boundary are of interest. 

The correlation between the observed depth data and the expected state 
therefore comprises a sum over all segments, and over all pixels of a segment, of 
the observed farness and the expected farness. 

To determine the estimate, the state generator 22 selects the set of two- 

10 dimensional parameters (x^, yj, r;) for which the difference between the observed 
and estimated depth values, summed over all of the pixels for all of the figure 
segments, is a minimum. Preferably, this minimum is determined by using a least 
squares fit. The various hypotheses can be generated, for example, by starting 
with the state that was estimated for the previous frame, and then proposing new 

15 hypothesized states which have configurations that are close to those of the 

previous estimate. Since the pose estimates are updated at a fairly high rate, e.g., 
video frame rates, the object of interest is not expected to change size or position 
significantly from one estimate to the next. This can be taken into account during 
the generation of new hypotheses, by weighting those which are closet to the most 

20 recent estimate as more likely than those which are farther away. 

From the foregoing, therefore, it can be seen that the present invention 
provides a mechanism for estimating the pose of an object from dense depth data. 
Of particular significance, the techniques of the present invention can be employed 
to estimate the pose of articulated objects, as well as solid objects. In one 

25 approach, the pose is estimated on the basis of the depth data alone, through the 

use of a model consisting of planar patches. A second approach dispenses with the 
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need for a model, by using both depth and brightness data, in conjunction with 
appropriate constraint equations. 

It will be appreciated by those of ordinary skill in the art that the present 
invention can be embodied in other specific forms without departing from the 
5 spirit or essential characteristics thereof. The presently disclosed embodiments are 
therefore considered in all respects to be illustrative, and not restrictive. The 
scope of the invention is indicated by the appended claims, rather than the 
foregoing description, and all changes that come within the meaning and range of 
equivalence thereof are intended to be embraced therein. 
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What is claimed is : 

1. A method for estimating the pose of an articulated figure, 
comprising the steps of: 

obtaining dense range data which describes the distance of points on 
the figure from a reference; and 

processing said dense range data to estimate the pose of the figure. 

2. The method of claim 1 wherein the dense range data is processed in 
accordance with a set of depth constraints to estimate the pose. 

3. The method of claim 2 wherein said depth constraints are linear. 

4. The method of claim 2 further including the steps of obtaining 
brightness data from an image of the figure, and processing said brightness data in 
accordance with a set of linear brightness constraints to estimate the pose. 

5. The method of claim 2 wherein said depth constraints are 
represented by means of twist mathematics. 

6. The method of claim 1 wherein said dense range data is compared 
with an estimate of pose to produce an error value, and said estimate is iteratively 
revised to minimize said error. 

7. The method of claim 6 wherein the estimate of pose is generated 
with reference to a model of the figure. 
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8. The method of claim 7 wherein said model comprises a set of planar 
patches which respectively correspond to segments of the articulated figure. 

9. The method of claim 8 wherein each patch comprises the convex 
hull of two circles. 

10. The method of claim 1 further including the steps of obtaining 
brightness data from an image of the figure, and processing said brightness data in 
accordance with a set of brightness constraints to estimate the pose. 

11. A method for estimating the pose of an object, comprising the steps 

of: 

obtaining dense range data which describes the distance of points on 
the object from a reference; and 

processing said dense range data in accordance with a set of linear 
depth constraints to estimate the pose of the object. 

12. The method of claim 11 wherein the object is articulated. 

13. The method of claim 12 wherein said depth constraints are 
represented by means of twist mathematics. 

14. The method of claim 13 further including the steps of mapping 
parameters which describe rotation and translation of the object to a set of linear 
parameters, solving for the depth constraints, and re-mapping back to the original 
parameters to provide a pose estimate. 
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15. The method of claim 11 wherein further including the steps of 
obtaining brightness data from an image of the object, and processing said 
brightness data in accordance with a set of linear brightness constraints to estimate 
the pose. 

16. The method of claim 11, wherein an estimate of the pose of the 
object includes an estimate for each of the orientation and translational positions of 
the object, and further including the steps of decoupling the estimate of orientation 
from the estimate of translational position. 

17. The method of claim 12, wherein said reference comprises a 
location on the object, and the pose is estimated, at least in part, from the positions 
of points on the object relative to said location. 

18. The method of claim 11, wherein the pose of the object is estimated 
for each image in a sequence of images, and further including the step of selecting 
a rigid translation value for each point on the object from one image to the next. 

19. The method of claim 18, wherein said rigid translation value is an 
integer value. 

20. The method of claim 18, wherein the rigid translation values are 
different for different points on the object. 

21. A method for estimating the pose of an object appearing in a 
sequence of video images, comprising the steps of: 
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obtaining dense brightness data for pixels in each of said video 

images; 

obtaining dense range data for pixels in each of said video images; 
determining an initial pose for the object in one of said video 

images; and 

estimating changes in at least one of the translational position and 
rotational orientation of the object for successive images, on the basis of said 
brightness data and said range data, to thereby estimate the pose of the object in 
successive images. 

22. The method of claim 21, wherein the object is an articulated object. 

23. The method of claim 21, wherein said estimates are obtained by 
means of lineao' constraint equations that are applied to said brightness data and 
said range data. 

24. A method for estimating the pose of an articulated object appearing 
in a sequence of video images, comprising the steps of: 

establishing a model for the surfaces of the articulated object; 

obtaining dense range data for pixels in each of said video images; 

generating a hypothetical pose for the object and determining the 
correlation of the hypothetical pose to the range data for an image; and 

recursively generating successive hypothetical poses and 
determining the correlation of each hypothetical pose to identify the pose having 
the closest correlation to the range data. 
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25. The method of claim 24, wherein the estimate of pose is generated 
with reference to a model of the figure. 

26. The method of claim 25, wherein said model comprises a set of 
planar patches which respectively correspond to segments of the articulated figure. 



27. The method of claim 25, wherein the correlation is determined with 
respect to pixels in an image that are located within a predetermined distance of 
edges of said planar patches. 
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ABSTRACT OF THE DTSCT.OSTmE 



Dense range data obtained at real-time rates is employed to estimate the 
pose of an articulated figure. In one approach, the range data is used in 
combination with a model of connected patches. Each patch is the planar convex 
hull of two circles, and a recursive procedure is carried out to determine an 
estimate of pose which most closely correlates to the range data. In another aspect 
of the invention, the dense range data is used in conjunction with image intensity 
information to improve pose tracking performance. The range information is used 
to determine the shape of an object, rather than assume a generic model or 
estimate structure from motion. In this aspect of the invention, a depth constraint 
equation, which is a counterpart to the classic brightness change constraint 
equation, is employed. Both constraints are used to jointly solve for motion 
estimates. 
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the first paragraph of Title 35, United States Code, §112, I acknowledge the duty to disclose to the Office all information 
known to me to be material to the patentability as defined in Title 37, Code of Federal Regulations §1.56, which became 
available between the filing date of the prior application(s) and the national or PCT international filing date of this 
application: 


PRIOR U.S. APPLICATIONS OR PCT INTERNATIONAL APPLICATIONS DESIGNATING THE U.S. FOR BENEFIT UNDER 35 U.S.C. §120: 


U.S. APPLICATIONS 


STATUS (check 


one) 


U.S. APPLICATION NUMBER 


U.S. FILING DATE 


PATENTED 


PENDING 


ABANDONED 
































PCT APPLICATIONS DESIGNATING THE U.S. 








PCT APPLICATION NO. 


PCT FILING DATE 


U.S. APPLICATION NUMBERS 
ASSIGNED (If any) 













































I hereby appoint the following attorneys and agent(s) to prosecute said application and to transact all business in the Patent 
and Trademark Office connected therewith and to file, prosecute and to transact all business in connection with 
international applications directed to said invention: 



William L. Mathis 


17,337 


R. Danny Huntington 


27,903 


Gerald F. Swiss 


30,113 


Robert S. Swecker 


19,885 


Eric H. Weisblatt 


30,505 


Michael J. Ure 


33,089 


Platon N. Mandros 


22,124 


James W. Peterson 


26,057 


Charles F. Wieland ffl 


33,096 


Benton S. Duffett, Jr. 


22,030 


Teresa Stanek Rea 


30,427 


Bruce T. Wieder 


33,815 


Norman H. Stepno 


22,716 


Robert E. Krebs 


25,885 


Todd R. Walters 


34,040 


Ronald L. Grudziecki 


24,970 


William C. Rowland 


30,888 


Ronni S. Jillions 


31,979 


Frederick G. Michaud, Jr. 


26,003 


T. Gene Dillahunty 


25,423 


Harold R. Brown HI 


36,341 


Alan E. Kopecki 


25,813 


Patrick C. Keane 


32,858 


Allen R. Baum 


36,086 


Regis E. Slutter 


26,999 


Bruce J. Boggs, Jr. 


32,344 


Steven M. du Bois 


35,023 


Samuel C. Miller, III 


27,360 


William H. Benz 


25,952 


Brian P. O'Shaughnessy 


32,747 


Robert G. Mukai 


28,531 


Peter K. Skiff 


31,917 


Kenneth B. Leffler 


36,075 


George A. Hovanec, Jr. 


28,223 


Richard J. McGrafh 


29,195 


Fred W. Hathaway 


32,236 


James A. LaBarre 


28,632 


Matthew L. Schneider 


32,814 


iiiiiiiiiiiiiiiiiniiii 




E. Joseph Gess 


28,510 


Michael G. Savage 


32,596 





Address all correspondence to: 



21839 



James A. LaBarre 

Burns, Doane, Swecker & Mathis, L.L.P. 
P.O. Box 1404 

Alexandria, Virginia 22313-1404 



Address all telephone calls to: James A. LaBarre 



at (703) 836-6620. 

I hereby declare that all statements made herein of my own knowledge are true and that all statements made on information 
and belief are believed to be true; and further that these statements were made with the knowledge that willful false state- 
ments and the like so made are punishable by fme or imprisonment, or both, under Section 1001 of Title 18 of the United 
States Code and that such willful false statements may jeopardize the validity of the application or any patent issue d thereon. 
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FULL NAME OF SOLE OR FIRST INVENTOR 
Michele M. Covell 


SIGNATURE 


DATE 


RESIDENCE 

Los Altos Hills. California 


CITIZENSHIP 

United States of America 


POST OFFICE ADDRESS 

12121 Page Mill Road. Los Altos Hills. California 94022. United States o 


f America 


FULL NAME OF SECOND JOINT INVENTOR, IF ANY 
Michael Hongmai Lin 


SIGNATURE 


DATE 


RESIDENCE 
Stanford. California 


CITIZENSHIP 


POST OFFICE ADDRESS 
Stanford, California 


FULL NAME OF THIRD JOINT INVENTOR, IF ANY 
Ali Rahimi 


SIGNATURE 


DATE 


RESIDENCE 
Belmont, California 


CITIZENSHIP 


; POST OFFICE ADDRESS 
2751 Belmont Canvon Road West, Belmont. California 94002. United States of America 


: FULL NAME OF FOURTH JOINT INVENTOR, IF ANY 
: Michael Harville 


SIGNATURE 


DATE 


: RESIDENCE 
Palo Alto, California 


CITIZENSHIP 

United States of America 


; POST OFFICE ADDRESS 
Post Office Box 60181, Palo Alto, California 94306, UNITED STATES OF AMERICA 


FULL NAME OF FIFTH JOINT INVENTOR, IF ANY 
i Trevor J. Darrell 


SIGNATURE 


DATE 


, RESIDENCE 

li San Francisco, California 


CITIZENSHIP 

United States of America 


1 POST OFFICE ADDRESS 

1 64 Guy Place, San Francisco, California 94105, United States of America 


FULL NAME OF SIXTH JOINT INVENTOR, IF ANY 
John I. Woodfill 


SIGNATURE 


DATE 


llESIDENCE 

Palo Alto, California 


CITIZENSHIP 

United States of America 


POST OFFICE ADDRESS 

777 Rhode Island #3, San Francisco, California 94107, United States of America 


FULL NAME OF SEVENTH JOINT INVENTOR, IF ANY 
Harlyn Baker 


SIGNATURE 


DATE 


RESIDENCE 

Los Altos, California 


CITIZENSHIP 

United States of America 


POST OFFICE ADDRESS 

414 Paco Drive, Los Altos, California 94024, United States of America 


FULL NAME OF EIGHTH JOINT INVENTOR, IF ANY 
Gaile G. Gordon 


SIGNATURE 


DATE 


RESIDENCE 

Palo Alto, California 


CITIZENSHIP 

United States of America 


POST OFFICE ADDRESS 

4237 Manuela Avenue, Palo Alto. California 94306, United States of America 
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